Purpose:

Identify differences in the usage of Cyclistic bikes between casual users and annual members.

The question I need to answer is: “How do annual members and casual riders use Cyclistic bikes differently?” Cyclistic wants to answer this question as part of their goal to design marketing strategies aimed at converting casual riders into annual members. The company defines “Casual” and “Member” users as such:

Casual riders: are customers who puchase single-ride or full day passes.

Annual members: are customers who purchase annual memberships.

My initial hypothesis is that customers who purchase annual memberships (Members) are people who use the bikes as a commuting vehicle, while customer who buy single-ride or full day passes (Casual) are people looking to explore the city for leisure.

Note: This business case is based on real-world bike-share data from a company called Divvy, located in Chicago, Illinois. I am completing this case study to submit for my Coursera Certificate in Google Data Analytics. Cyclistic is the fictional name given to the company for the purposes of the certificate course

I’ll start by setting up my R environment, loading in the data, and exploring it. The following is the code I wrote to import the data. It downloads each zip file, extracts the csv file, and then imports the data into a data frame, excluding columns I won’t be using. After this, I use the skimr package to view a data summary of the data set. The whole uncompressed dataset is about 1.2 GB in size, so be sure you have enough memory available if you intend to load it this way.

# load libraries:
library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.1
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
# Create a vector of urls to download the files:
file_urls <-
  c(
    "https://divvy-tripdata.s3.amazonaws.com/202306-divvy-tripdata.zip",
    "https://divvy-tripdata.s3.amazonaws.com/202307-divvy-tripdata.zip",
    "https://divvy-tripdata.s3.amazonaws.com/202308-divvy-tripdata.zip",
    "https://divvy-tripdata.s3.amazonaws.com/202309-divvy-tripdata.zip",
    "https://divvy-tripdata.s3.amazonaws.com/202310-divvy-tripdata.zip",
    "https://divvy-tripdata.s3.amazonaws.com/202311-divvy-tripdata.zip",
    "https://divvy-tripdata.s3.amazonaws.com/202312-divvy-tripdata.zip",
    "https://divvy-tripdata.s3.amazonaws.com/202401-divvy-tripdata.zip",
    "https://divvy-tripdata.s3.amazonaws.com/202402-divvy-tripdata.zip",
    "https://divvy-tripdata.s3.amazonaws.com/202403-divvy-tripdata.zip",
    "https://divvy-tripdata.s3.amazonaws.com/202404-divvy-tripdata.zip",
    "https://divvy-tripdata.s3.amazonaws.com/202405-divvy-tripdata.zip"
  )

# create an empty dataframe to hold all the data:
rides <- data.frame()
# create a temp dataframe that will be used in each iteration of the loop below:
tempdf <- data.frame()

for(url in file_urls) {
  # download each zip file into a temp file
  temp <- tempfile()
  download.file(url, temp)
  
  # each zip file has a csv and a folder, we only want the csv file.
  # they are named similarly to the zip file
  filename <- paste(substring(url, 41, 62), "csv", sep = "")
  
  # extract the csv file we want from the zip file.
  temp <- unz(temp, 
              filename, 
              open = "", 
              encoding = getOption("encoding"))
  
  # read the csv file into a temporary data frame, only importing columns
  # that will be used for the remainder of the analysis:
  tempdf <- read_csv(temp,
                     col_select = c(
                       ride_id,
                       started_at,
                       ended_at,
                       member_casual),
                     show_col_types = FALSE)
  
  # add the rows from the temporary dataframe into the main dataframe
  rides <- rbind(rides, tempdf)
  
  # delete the temp file, wipe the temp dataframe
  unlink(temp)
  tempdf <- data.frame()
  
}

skim(rides)
Data summary
Name rides
Number of rows 5743278
Number of columns 4
_______________________
Column type frequency:
character 2
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1 16 16 0 5743278 0
member_casual 0 1 6 6 0 2 0

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2023-06-01 00:00:44 2024-05-31 23:59:57 2023-09-28 20:55:19 4841289
ended_at 0 1 2023-06-01 00:02:56 2024-06-02 00:56:55 2023-09-28 21:09:16 4852976

Data Cleaning

Several rides are missing start or end station names, and the data set contains a number of rows with missing coordinates for the end of the ride (end_lat and end_lng). Some rides are missing station names and/or ids. This can be explained by the nature of the bike share service: not all rides start and/or end at a station. Customers can end a ride on some bikes without bringing the bike back to a dock. I chose not to remove these rows for this reason.

I added a column for trip duration. I also made separate columns from the ride start times to capture week day, day of month, month, and hour of day for trip departures, to make analysis simpler later on.

I noticed that there are several rides with negative ride durations. I removed these rows as they are clearly errors. I also removed any ride shorter than one minute, as the data source describes rides less than 60 seconds as: “potentially false starts or users trying to re-dock a bike to ensure it was secure” (from https://divvybikes.com/system-data). These made up a very small proportion of the data.

I also identified rides that were excessively long. My assumption is that these rides were actually cases where the ride was not ended when it should have been, either because the customer failed to dock the bike properly or an issue with the bike or dock systems caused the ride to not end properly. I do not believe these ride durations represent actual bike rides and so I removed any ride longer than 12 hours.

Below, I investigate the distribution of ride durations using a histogram and statistical summaries to come to the conclusion that removing rides longer than 12 hours does not lead to losing valuable data. The first histogram below is “zoomed in” to the lower values on the y axis so that the long “tail” of the distribution is visible.

The cleaning process is detailed in the code below:

# Cleaning

# Checking ride_id:
sum(str_equal(rides$ride_id, "")) # check for empty strings just in case
## [1] 0
# Check that all ride_ids are unique by counting unique ids (should match the 
# number of rows in the dataset:
print(paste("Number of unique rides: " , rides %>% 
  select(ride_id) %>% 
  unique() %>% 
  count()))
## [1] "Number of unique rides:  5743278"
print(paste("Total number of rides: " , rides %>% count()))
## [1] "Total number of rides:  5743278"
# Create a ride_duration column to capture how long each ride was:
rides <- rides %>% 
  mutate(ride_duration = (as.numeric(ended_at - started_at) / 60))

# remove rides with negative ride duration, and rides shorter than 60 seconds:
rides <- rides %>% 
  filter(ride_duration >= 1)

# Use skim to view a summary of ride_duration:
skim(rides, ride_duration)
Data summary
Name rides
Number of rows 5607110
Number of columns 5
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ride_duration 0 1 18.8 169.42 1 5.78 9.93 17.48 98489.07 ▇▁▁▁▁
# There are some rides that have longer than 24 hour duration, a small
# percentage. These excessively long rides don't really make any sense and are
# most likely erroneous. To determine a fair cutoff point, I'll use statistics
# and look at the histogram of ride durations:
rides %>% 
  ggplot(aes(x = ride_duration))+
  geom_histogram(bins = 300, color = "steelblue")+
  scale_x_continuous(breaks = seq(min(rides$ride_duration), 
                                  max(rides$ride_duration), 
                                  length.out = 40))+
  coord_cartesian(ylim = c(0,500))+ # zoom in to see the tail more clearly
  labs(title = "Histogram: Ride Duration Distribution",
       subtitle = "Zoomed in to see tail of distribution",
       x = "Ride Duration (seconds)", 
       y = "Frequency")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# The above histogram shows a steep drop off in observations in ride duration
# past a few hours.
# It's clear that the long duration rides make up a very small proportion of
# the data, and can be ignored. Keeping or removing them makes little difference
# to the overall statistics.

# I am going to assume it's safe to remove any ride over 12 hours.
# I am going to assume that any rides longer than 12 hours was actually someone
# forgetting to re-dock the bike, and I won't be including these rows.
# I would also point out that google maps estimates it would take a little over
# 2 hours to cycle from the northern border of Chicago to the southern border,
# so rides longer than 12 hours seems unlikely.
# Also, any ride longer than 3 hours starts to cost extra per minute.

# Count how many rides are longer than 12 hours (how many rows to be removed):
print(paste("Number of rides longer than 12 hours:", 
            rides %>% 
              select(ride_duration) %>% 
              filter(ride_duration > (60*12)) %>% 
              count()))
## [1] "Number of rides longer than 12 hours: 10525"
# remove all rides longer than 12 hours:
rides <- rides %>% 
  filter(ride_duration < (60*12))

# run that plot again:
rides %>% 
  ggplot(aes(x = ride_duration))+
  geom_histogram(bins = 100)+
  scale_x_continuous(breaks = seq(min(rides$ride_duration), 
                                  max(rides$ride_duration), 
                                  length.out = 10))+
  scale_y_continuous(labels = scales::comma)+
  #coord_cartesian(ylim = c(0,100))+
  labs(title = "Ride Duration Distribution",
       subtitle = "After removing rides >12 hours",
       x = "Ride Duration (minutes)", 
       y = "Frequency")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# This histogram still suggests that I could remove more of the long rides, and 
# in fact it might make sense to remove rides over 3 hours. I will keep the 12
# hour limit because I think it's realistic that a small percentage of users
# actually do ride the bikes for more than 3 hours, and I want them to be
# represented.

# Summary statistics after removing rides longer than 12 hours:
skim(rides, ride_duration)
Data summary
Name rides
Number of rows 5596585
Number of columns 5
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ride_duration 0 1 15.23 21.74 1 5.78 9.92 17.42 719.93 ▇▁▁▁▁
# Note that the mean hasn't changed much from before removing the 12 hour rides.

# Now I'll add some time columns to make analysis easier:
rides <- rides %>% 
  mutate(week_day = wday(rides$started_at)) %>% 
  mutate(day_of_month = mday(rides$started_at)) %>% 
  mutate(month = month(rides$started_at)) %>% 
  mutate(hour_of_day = hour(rides$started_at))

# Arrange the dataset by date:
rides <- rides %>% 
  arrange(rides$started_at)

# Make member_casual a labelled factor:
rides$member_casual <- factor(rides$member_casual, 
                              levels = c("member", "casual"),
                              labels = c("Member", "Casual"))

# Make week_day a labelled factor:
rides$week_day <- factor(rides$week_day,
                         levels = c(1,2,3,4,5,6,7),
                         labels = c("Sunday", "Monday", "Tuesday",
                                    "Wednesday", "Thursday", "Friday",
                                    "Saturday"))
# Make month a labelled factor:
rides$month <- factor(rides$month,
                      levels = c(1:12),
                      labels = c("Jan","Feb","Mar","Apr",
                                 "May","Jun","Jul","Aug",
                                 "Sep","Oct","Nov","Dec"))

# skim again to verify no NAs exist and see a summary of the dataset:
skim(rides)
Data summary
Name rides
Number of rows 5596585
Number of columns 9
_______________________
Column type frequency:
character 1
factor 3
numeric 3
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1 16 16 0 5596585 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
member_casual 0 1 FALSE 2 Mem: 3607285, Cas: 1989300
week_day 0 1 FALSE 7 Sat: 885696, Thu: 836955, Fri: 828661, Wed: 811450
month 0 1 FALSE 12 Aug: 751675, Jul: 745982, Jun: 699663, Sep: 650915

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ride_duration 0 1 15.23 21.74 1 5.78 9.92 17.42 719.93 ▇▁▁▁▁
day_of_month 0 1 15.58 8.81 1 8.00 16.00 23.00 31.00 ▇▇▇▇▆
hour_of_day 0 1 14.07 4.92 0 11.00 15.00 18.00 23.00 ▁▃▅▇▃

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2023-06-01 00:00:44 2024-05-31 23:59:57 2023-09-29 01:36:12 4737212
ended_at 0 1 2023-06-01 00:02:56 2024-06-01 09:46:16 2023-09-29 01:45:19 4748351

The data has been checked for null or missing values, the ride ids have been checked for uniqueness, and rides with excessively long or negative duration have been removed. A few calculated columns have been added to make analysis simpler. The data is now ready for analysis.

Analysis

Using different plots of the variables, differences in bike usage were identified between Members and Casual users.

Both groups exibit seasonal usage differences.

Lets look at bike usage by month to see how each group uses the bikes during different times of the year:

# count of rides by month:

m_plot_1 <- rides %>% 
  group_by(month, member_casual) %>% 
  count() %>% 
  ggplot(aes(x = month,
             y = n,
             fill = n,
             text = n))+
  geom_col()+
  facet_wrap(~member_casual)+
  labs(title = "Ride Count by Month",
       subtitle = "The warmer months are most popular for both groups",
       caption = "Data from: https://divvybikes.com/system-data",
       x = "Month", 
       y = "Number of Rides",
       fill = "Number of Rides")+
  scale_y_continuous(labels = scales::comma)+ # format numbers on axis
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_fill_continuous(labels = scales::comma) # format numbers on legend
ggplotly(m_plot_1, tooltip = "text")
# compare Month and ride duration
m_plot_2 <- rides %>% 
  group_by(member_casual, month) %>% 
  summarise(avg_ride_duration = mean(ride_duration)) %>% 
  ggplot(aes(x = month, 
             y = avg_ride_duration,
             fill = avg_ride_duration,
             text = paste(month, "avg. ride duration:", avg_ride_duration))) +
  geom_col()+
  facet_wrap(~member_casual) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  labs(
    title = "Average Ride Duration by Month",
    subtitle = "Members have more consistent ride durations than casual users",
    caption = "Data from: https://divvybikes.com/system-data",
    x = "Month",
    y = "Average ride duration (mins)",
    fill = "Avg. ride duration (mins)"
  )
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
ggplotly(m_plot_2, tooltip = "text")

The above two charts demonstrate the relationship between time of year and bike usage. The first chart demonstrates the increasing popularity of the bike service during the warmer months for both groups. The second chart shows that Members tend to ride the bikes for the same amount of time regardless of the time of year, while Casual users tend to increase ride duration more significantly during the warmer months.

The day of the week influences bike usage in both member and casual user groups.

Examine the following plots that show how different days of the week influence bike usage for each customer group:

# count of rides by week day:

wk_plot_1 <- rides %>%
  group_by(week_day, member_casual) %>% 
  count() %>% 
  ggplot(aes(x = week_day,
             y = n,
             fill = n,
             text = paste(week_day, "ride count:", n))) +
  geom_col() +
  labs(title = "Ride Count by Weekday", 
       subtitle = "Casual users use bikes more on the weekend",
       caption = "Data from: https://divvybikes.com/system-data",
       x = "Weekday", 
       y = "Number of Rides",
       fill = "Number of Rides")+
  scale_y_continuous(labels = scales::comma)+ # format numbers on axis
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_fill_continuous(labels = scales::comma)+ # format numbers on legend
  facet_wrap(~member_casual)
ggplotly(wk_plot_1, tooltip = "text")
# compare week day and ride duration

wk_plot_2 <- rides %>% 
  group_by(member_casual, week_day) %>% 
  summarise(avg_ride_duration = mean(ride_duration)) %>% 
  ggplot(aes(x = week_day, 
             y = avg_ride_duration,
             fill = avg_ride_duration,
             text = paste(week_day, "avg. ride duration:", avg_ride_duration)))+
  geom_col()+
  facet_wrap(~member_casual) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  labs(
    title = "Average Ride Duration by Weekday",
    subtitle = "Member and Casual users compared",
    caption = "Data from: https://divvybikes.com/system-data",
    x = "Week day",
    y = "Average ride duration (mins)",
    fill = "Avg. ride duration (mins)"
  )
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
ggplotly(wk_plot_2, tooltip = "text")

As can be seen above, Members had a higher usage during the weekdays while Casual users had a higher usage during weekends. This implies that Members use the bikes during the typical Monday through Friday workweek while Casual users use the bikes for leisure on the weekends. Also, Members use the bikes with a very consisent ride duration. They are likely using it for commuting, and so their day to day trips are about the same amount of time. Casual users have higher durations overall, and the weekends have longer bike rides for both groups.

Time of day influences both user groups differently and highlights commuting trend.

The below charts examine how both groups use the bikes during different times of the day.

# count of rides by time of day
tm_plot_1 <- rides %>% 
  group_by(hour_of_day, member_casual) %>% 
  count() %>% 
  ggplot(aes(x=hour_of_day,
             y = n,
             fill = n,
             text = paste("Count:", n)))+
  geom_col()+
  scale_x_continuous("Time of Day", 
                     breaks = c(2, 6, 10, 14, 18, 22),
                     labels = c("2AM", "6AM", "10AM", "2PM", "6PM", "10PM"))+
  scale_y_continuous("Number of rides", 
                     labels = scales::comma)+
  scale_fill_continuous(labels = scales::comma)+ # format numbers on legend
  labs(title = "Bike rides by time of day",
       subtitle = "Comparision of Members and Casual users",
       caption = "Data from: https://divvybikes.com/system-data",
       fill = "Number of rides")+
  facet_wrap(~member_casual)
ggplotly(tm_plot_1, tooltip = "text")
# average ride duration for each hour of day:
tm_plot_2 <- rides %>% 
  group_by(hour_of_day, member_casual) %>% 
  summarize(avg_ride_duration = mean(ride_duration)) %>% 
  ggplot(aes(x = hour_of_day, 
             y = avg_ride_duration, 
             fill = avg_ride_duration,
             text = paste("Avg ride duration:", avg_ride_duration)))+
  geom_col()+
  facet_wrap(~member_casual,
             labeller = labeller())+
  scale_x_continuous("Time of Day", 
                     breaks = c(2, 6, 10, 14, 18, 22),
                     labels = c("2AM", "6AM", "10AM", "2PM", "6PM", "10PM"))+
  scale_y_continuous("Average ride duration (minutes)")+
  labs(
    title = "Average ride duration by time of day",
    subtitle = "Member and casual riders show differing ride durations by time of day",
    caption = "Data from: https://divvybikes.com/system-data",
    fill = "Avg. ride duration (mins)"
  )
## `summarise()` has grouped output by 'hour_of_day'. You can override using the
## `.groups` argument.
ggplotly(tm_plot_2, tooltip = "text")

Members tend to ride the bikes more often during peak commuting times, such as 8AM and 5PM. Casual users have usage peaking around 5PM as well, but are more evenly spread throughout the rest of the day. Members tend to have consistent ride durations regardless of time of day, while Casual users tend to have longer rides between the hours of 10AM and 2PM.

Main takeaways:

The purpose of this analysis was to identify differences between the Member and Casual customer groups in an effort to target marketing efforts and convert more Casual users to annual Members. Given more data about each customer (for example, how many rides each customer took) could allow for more in-depth analysis of customer trends.
Given the data available, the following are the top three takeaways from my analysis:

1. Members tend to use bikes more on weekdays and during commuting hours.

As shown above, Members tend to use the bikes more during rush hour on weekdays (Mon-Fri at 8AM and 5PM). Members also take shorter bike rides that are more consistent length than Casual members, who take longer rides.

2. Casual users use bikes more often on weekends and in the afternoon/early evening.

Casual users are more likely to use the bikes for leisure riding on the weekends during midday to early evening. Casual users also take longer bike rides on average overall, especially on the weekends and in the middle of the day.

3. July - August is when bikes are used the most by both groups.

The warmer months bring a lot more usage of the bikes by both Members and Casual users, but casual users peak in usage even more noticeably.

Next steps and Recommendations

The usage patterns explored above identified that the key difference between the Member and Casual user groups was that Members seem to use the bikes primarily as a commuting vehicle.
To convert more Casual users into annual Members, the company should target their marketing efforts at Casual users while keeping the following in mind:

  1. Casual users who are looking for an alternative to other commuting vehicles or as an alternative to public transit would make an easy conversion. Highlight how flexible and easy using the service is to these people.
  2. Marketing near bus and train stations would be very effective at converting more customers who wish to use the service for commuting. Point out the time savings the customers could experience as well as the fun of riding a bike.
  3. The peak season for bike usage is in the warmer months and casual users are primarily using the bikes on the weekends. Emphasize how much more enjoyable it is to explore the city by bike vs. on foot or by car.

Citations:

Data source:
Divvy Bikes Divvy Bikes is Operated by Lyft Bikes and Scooters, LLC Data provided here: https://divvybikes.com/system-data under licence agreement here: https://divvybikes.com/data-license-agreement

R:
R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

RStudio:
Posit team (2024). RStudio: Integrated Development Environment for R. Posit Software, PBC, Boston, MA. URL http://www.posit.co/.

tidyverse package:
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, Françoi R, Grolemun G, Haye A, Henr L, Heste J, Kuh M, Pederse TL, Mille E, Bach SM, Müll K, Oo ,J, Robins ,D, Seid ,DP, Spi ,V, Takahas ,K, Vaugh ,D, Wil ,C, W ,K, Yutani ,H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.

skimr package:
Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and Flexible Summaries of Data. R package version 2.1.5, https://CRAN.R-project.org/package=skimr.